Statistical Learning Project
Introduction
Importing Dataset and uploading libraries
library(imputeMissings)
library(rmdformats)
library(GGally)
library(circlize)
library(psych)
library(pscl)
library(dendextend)
library(sjPlot)
library(gridExtra)
library(fclust)
library(FactoMineR)
library(class)
library(cluster)
library(factoextra)
library(tidyverse)
library(ggpubr)
library(caret)
library(car)
library(lawstat)
library(lmtest)
library(MASS)
library(readxl)
Efige <- read_excel("/Users/saralattarulo/Desktop/Project/data_orig.xlsx",
sheet = "all",na = "9999999999")VARIABLES
European Firms in a Global Economy (EFIGE), is a dataset containing information about a sample of 14758 firms for a total of 489 qualitative and quantitative variables (obtained by the survey questions) in seven EU economies:
-Germany
-France
-Italy
-Spain
-United Kingdom
-Austria
-Hungary
Data was collected in 2010, covering the years from 2007 to 2009.
Being this dataset very huge, we will need to face many issues such as dealing with NA’s, outliers and establishing which will be the most significant variables for our aims.
I selected 28 variables focusing on the first issue that we could bump: the time-series. Since this data was collected by taking in account a range of years, to avoid Time-series problems, all my variables are selected on year 2008. As a consequence the section C about investments on R&D activities has been excluded.
Section A - STRUCTURE OF THE FIRM:
COUNTRY: country of the firm (categorical) country
PAVITT: Classification of the firm (categorical) pavitt
A1: year of establishment of the firm (categorical) age
A2A: urnover made by core product (%) core_p
A3: Annual turnover (categorical)turnov
A16: Foreign affiliates(abs) foreign_affiliates
A20: CEO is a family member family (categorical) fam_ceo
A16_2: First shareholder: share of capital(%) first_share
SECTION B - WORKFORCE
B3: Total number of employees (abs) empl
b4_2_1: Enterpreneurs/executives not related to the family who owns the company (abs) no_family_n
b4_2_2: Enterpreneurs/executives related to the family who owns the company (abs) Family_n
b4_2_2: White collars (abs) white_n
b4_2_4: Skilled blue collars (abs) skilled_n
b4_2_5: Unkilled blue collars (abs) skilled_n
B5_2: Total number of employees have been involved in R&D activities? (abs) involv_n
B6_2: Total number of of university graduates in your workforce in your home country (abs) grad_n
B7_2: Total number of foreign employees in your workforce in your home country (abs) foreign_empl_n
B22: employers that did training programs (%) training
B23: Training type (categorical) training_type
Section D – INTERNATIONALIZATION
D6: Total exporter countries(abs) exp_richness
Section F - FINANCE
F0: External financing (categorical) external_financing
F9: Number of banks f9
F10: Your firm’s total bank debit is held at your main bank (%) f10
F11: How many years has this bank been the firm’s main bank f11
Section E - MARKET & PRICING
E1: Firm’s turnover was made up by sales of produced-to order goods (%) po
Extra variables
Global Exporter: firm exporting to China or India or other asian countries in the USA or Canada or Central or south America (categorical)
mkt_innov: Firms that carried out new to market innovation (categorical)
family_tmt: family people forming the business (%)
After uploading the dataset, since we have variables of different kinds (even among numericals), I will upload some useful functions such as the Standardization, Normalization and one to remove outliers
# Removing outliers
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
# Population standard deviation function
PopSD= function (x){
sum((x-mean(x))^2)/length(x)
}
# Variable standardizer function
Standardize = function(x){(x-mean(x))/PopSD(x)}
# Normalization
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}Variables that we are taking into account are selected step by step according to the study that we are implementing.
LET’S START
Multiple linear Regression
Thanks to the Multiple linear regression, we are able to predict a numerical outcome not using just one predictor, but more and more!
In our case, we are trying to understand if the number of Foreign employers in an Italian and French firm, depends or not on the following variables:
Total exporter countries;
Employers involved in research and development;
Percentage of turnover of the firm related to produced-to-other goods;
If the firm carried out any market innovation;
To represent the causal path underneath the model i am using DAG (Directed Acyclic Graphs).
Extract the subset and for this time I will neglect NA since we are loosing few observations in this case.
Regression<- Efige %>%
dplyr::select(foreign_empl_n,exp_richness, no_family_n,involv_n,mkt_innov,po) %>%
dplyr:: mutate(country=as.factor(country),
mkt_innov=as.factor(mkt_innov),)%>%
filter(country=="ITA"|country=="FRA")
Regression<-na.omit(Regression)
head(Regression)## # A tibble: 6 x 7
## foreign_empl_n exp_richness no_family_n involv_n mkt_innov po country
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
## 1 0 25 14 17 0 100 FRA
## 2 0 5 1 1 0 100 FRA
## 3 0 4 4.62 0 0 100 FRA
## 4 3 3 3 2 0 100 FRA
## 5 0 3 0 0 0 100 FRA
## 6 1 3 25 40 0 100 FRA
## vars n mean sd median trimmed mad min max range skew
## foreign_empl_n 1 3048 3.12 13.61 1 1.23 1.48 0 500 500 21.90
## exp_richness 2 3048 10.91 14.43 6 7.96 5.93 1 200 199 4.16
## no_family_n 3 3048 6.05 23.30 1 2.17 1.48 0 700 700 14.64
## involv_n 4 3048 5.42 47.49 2 2.18 2.97 0 2500 2500 47.86
## mkt_innov* 5 3048 1.40 0.49 1 1.37 0.00 1 2 1 0.42
## po 6 3048 83.41 31.60 100 91.39 0.00 0 100 100 -1.82
## country* 7 3048 3.76 1.48 5 3.82 0.00 2 5 3 -0.35
## kurtosis se
## foreign_empl_n 672.43 0.25
## exp_richness 29.52 0.26
## no_family_n 326.87 0.42
## involv_n 2495.61 0.86
## mkt_innov* -1.82 0.01
## po 1.77 0.57
## country* -1.88 0.03
Pre-processing the data: standardizing predictors because we want to give the same weight to all variables in the model
Regression$foreign_empl_n<-Standardize(Regression$foreign_empl_n)
Regression$exp_richness<-Standardize(Regression$exp_richness)
Regression$po<-Standardize(Regression$po)
Regression$involv_n<-Standardize(Regression$involv_n)
head(Regression)## # A tibble: 6 x 7
## foreign_empl_n exp_richness no_family_n involv_n mkt_innov po country
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
## 1 -0.0168 0.0677 14 0.00514 0 0.0166 FRA
## 2 -0.0168 -0.0284 1 -0.00196 0 0.0166 FRA
## 3 -0.0168 -0.0332 4.62 -0.00240 0 0.0166 FRA
## 4 -0.000627 -0.0380 3 -0.00152 0 0.0166 FRA
## 5 -0.0168 -0.0380 0 -0.00240 0 0.0166 FRA
## 6 -0.0114 -0.0380 25 0.0153 0 0.0166 FRA
At first glance we notice that e don’t have linear relationships among variables; so for sure we will make some transformations on the variables.
Multiple linear Regression Model
After many trials I came up to the solution of transforming my model into a Log-lin model (with the exception of involv_n that i also transformed in log)
MLR=lm(log(foreign_empl_n)~ exp_richness + po + log(involv_n) + mkt_innov, data = Regression)
summary(MLR)##
## Call:
## lm(formula = log(foreign_empl_n) ~ exp_richness + po + log(involv_n) +
## mkt_innov, data = Regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2957 -0.9856 -0.0254 0.7803 3.4144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.83547 0.43045 -4.264 3.72e-05 ***
## exp_richness 2.26741 0.88978 2.548 0.01193 *
## po 0.41367 3.02275 0.137 0.89135
## log(involv_n) 0.22960 0.07076 3.245 0.00148 **
## mkt_innov1 -0.39157 0.21598 -1.813 0.07202 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.251 on 137 degrees of freedom
## (2906 observations deleted due to missingness)
## Multiple R-squared: 0.154, Adjusted R-squared: 0.1293
## F-statistic: 6.236 on 4 and 137 DF, p-value: 0.0001221
Parameters interpretations
Other predictors constant, we can say:
2.26741 is the expected mean change of log of foreign employers in terms of the number of standard deviations that we expect to observe when total exporter countries changes by 1 Standard deviation
In this case it has a positive effect and is significant since the p-value is less than 0.05;
0.41367 is the mean change of log of foreign employers in terms of the number of standard deviations that we expect to observe when the percentage of turnover made up by produced-to-other goods changes by 1 Standard Deviation.
In this case it has a positive effect, but it’s not significant at all since the pvalue is way greater than 0.05; 0.22960 is the mean change of log of foreign employers in terms of number of standard deviations that we expect to observe when log of people involved in Research and development activities changes by 1 Standard Deviation,
In this case it has a positive effect and it’s very significant since the p-value is 0.00148;
-0.39157 is the average change in the log of foreign employers when firms carried out new market innovations. It has a negative effect and actually it’s not really significant if we take in account a 0.05 level of significance.
The adjusted R-squared may be low for many reasons, but in this case it’s normal seeing it very low because we are working on a huge dataset, with more than 100 omitted variables that may be significant.. so that’s fair.
Let’s proceed with our assumptions check.
Assumptions
- Linearity assumption
Every parameter seems to have an approximately linear relationship with the outcome
- Mean of our error term is zero and normally distributed
## [1] 2.053399e-17
##
## Shapiro-Wilk normality test
##
## data: MLR$residuals
## W = 0.9852, p-value = 0.1307
The mean of our residuals is equal to 0 and our residuals are normally distribuited
- Homoscedasticity of residuals: the variance must be constant
##
## studentized Breusch-Pagan test
##
## data: MLR
## BP = 2.3463, df = 4, p-value = 0.6724
Since the p-value is greater than 0.05, we can asset the null hypothesis for which the variance of residuals is constant
- Independence of errors
##
## Durbin-Watson test
##
## data: MLR
## DW = 1.8656, p-value = 0.2114
## alternative hypothesis: true autocorrelation is greater than 0
The p-value here appears to be higher than 0.05, so we can asset the null hypothesis for which there is independence of errors.
- The x variables and residuals are uncorrelated
Take in account Endogeneity problem in this case.
#H0: corr coefficient equal to 0
#length(Regression$exp_richness)
#length(MLR$residuals)
#cor(Regression$exp_richness, MLR$residuals)
#cor(Regression$no_family_n, MLR$residuals)
#cor(Regression$involv_n, MLR$residuals)- Absence of perfect multicollinearity
## exp_richness po log(involv_n) mkt_innov
## 1.282009 1.201141 1.054852 1.027426
To have a first idea about the multicollinearity among variables we can plot them one against the other. From a preliminary point of view we can see that we have absence of multicollinearity.
To be sure we might check the VIF (Variance Inflaction Factor), that usually indicates a model without multicollinearity when it’s less than 2.
As we see, there is no multicollinearity, therefore the assumption is met.
Logistic Regression
Thanks to Logistic regression, we are able to predict a categorical target with more than one regressor.
In our case, we are trying to understand if the probility of being in the group of firm exporting to China or India or other asian countries in the USA or Canada or Central or south America depends on:
Country of the firm;
If the CEO is a family member or not;
If the firm carried out market innovations
People involved in R&D;
Number of graduated employers in the firm.
In this case after extracting the sample we input our NA with the median/mode method using the function impute(), in order to not loose information.
Logistic<-Efige %>%
dplyr::select(pavitt,age,country,mkt_innov,Global_exporter,fam_ceo,involv_n,grad_n)%>%
filter(pavitt=="Specialized industries")%>%
dplyr:: mutate(pavitt=as.factor(pavitt),
age=as.factor(age),
mkt_innov=as.factor(mkt_innov),
country=as.factor(country),
fam_ceo=as.factor(fam_ceo))
data<-imputeMissings::compute(Logistic, method = "median/mode")
Logistic<-imputeMissings::impute(Logistic, object = data)
Logistic <- na.omit(Logistic)
head(Logistic)## pavitt age country mkt_innov Global_exporter
## 1 Specialized industries More than 20 yrs AUT 1 1
## 2 Specialized industries More than 20 yrs AUT 0 0
## 3 Specialized industries More than 20 yrs AUT 1 1
## 4 Specialized industries More than 20 yrs AUT 0 1
## 5 Specialized industries More than 20 yrs AUT 0 0
## 6 Specialized industries Between 20 and 6 yrs AUT 1 0
## fam_ceo involv_n grad_n
## 1 0 200.0 35.0
## 2 1 0.0 0.0
## 3 1 31.5 10.5
## 4 1 0.0 0.0
## 5 1 0.0 0.0
## 6 1 2.0 2.0
Logistic Model
After standardizing our numerical variables we are ready to build our model:
Logistic$involv_n=Standardize(Logistic$involv_n)
Logistic$grad_n=Standardize(Logistic$grad_n)
LR <- glm(Global_exporter ~ country + fam_ceo + mkt_innov + involv_n*grad_n, data=Logistic, family="binomial")
summary(LR)##
## Call:
## glm(formula = Global_exporter ~ country + fam_ceo + mkt_innov +
## involv_n * grad_n, family = "binomial", data = Logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3605 -0.9375 -0.7027 1.1936 2.3409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.97704 0.28471 -3.432 0.00060 ***
## countryFRA 0.14154 0.29260 0.484 0.62857
## countryGER 0.08751 0.28735 0.305 0.76072
## countryHUN -1.29254 0.42043 -3.074 0.00211 **
## countryITA 0.85191 0.28993 2.938 0.00330 **
## countrySPA 0.49326 0.29401 1.678 0.09341 .
## countryUK 0.43084 0.29849 1.443 0.14891
## fam_ceo1 -0.23380 0.08938 -2.616 0.00890 **
## mkt_innov1 0.81650 0.08671 9.416 < 2e-16 ***
## involv_n 5.09459 1.19240 4.273 1.93e-05 ***
## grad_n 8.83055 2.01994 4.372 1.23e-05 ***
## involv_n:grad_n -16.09100 3.50387 -4.592 4.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3546.5 on 2664 degrees of freedom
## Residual deviance: 3247.9 on 2653 degrees of freedom
## AIC: 3271.9
##
## Number of Fisher Scoring iterations: 4
Parameters interpretations
Having AUT as reference and considering the comparison with HUN there is a statistically significant negative effect (-1.29254) on the global_exporter.
Having AUT as reference and considering the comparison with ITA there is a statistically significant positive effect (0.85191) on the global_exporter.
Having fam_ceo0 (the CEO is not from family) as reference and considering the comparison with fam_ceo1 (the CEO is from family) there is a statistically significant negative effect (-0.23380 ) on the global_exporter.
Having mkt_innov0 (no market innovation) as reference and considering the comparison with mkt_innov1 (market innovation) there is a statistically significant positive effect (0.81650) on the global_exporter.
involv_n: 5.09459 is the average change in the logit when involv_n change by 1 unit
grad_n: 8.83055 is the average change in the logit when grad_n change by 1 unit
Moderation effect
involv_n:grad_ : -16.1% is the average change in the logit involv_n:grad_n change by 1 unit.
It is interesting to see that if separated, the number of employers involved in R&D and the number of graduates employers have positive impact, while if we consider the number of graduates as a moderator of the number of the employers involved in R&D we have a significant negative effect.
ODDS interpretation
As an odd interpretation instead, we try to understand what has happened to the odd, changing it by 1 unit and taking in account the proportional rate at which the predicted odds ratio changes with each successive unit of x (how much the average odds ratio varies when an unit of independent variable)
## (Intercept) countryFRA countryGER countryHUN countryITA
## -62.357708 15.204523 9.145313 -72.542814 134.411796
## countrySPA countryUK fam_ceo1 mkt_innov1 involv_n
## 63.764065 53.854495 -20.847942 126.257303 16213.650236
## grad_n involv_n:grad_n
## 683908.140801 -99.999990
countryHUN = -72.5% decrease in the odds (for a one point increase in HUN score) of being in the firms that exported.
countryITA = 134.4 % increase in the odds (for a one point increase in ITA score) of being in the firms that exported.
fam_ceo1 = -20.8% decrease in the odds (for a one point increase in fam_ceo1 score) of being in the firms that exported.
mkt_innov1 = 126.2% decrease in the odds (for a one point increase in mkt_innov1 score) of being in the firms that exported.
involv_n: 16213% increase in the odds (for a one point increase in involv_n score) of being in the firms that exported.
grad_n= 683908% increase in the odds (for a one point increase in grad_n score) of being in the firms that exported.
involv_n:grad_n = -99.9% decrease in the odds (for a one point increase in involv_n score) of being in the firms that exported.
Pseudo R-squared
Analogue to an R-squared, this index is able to explain the variability of the model
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -1.623946e+03 -1.773262e+03 2.986320e+02 8.420412e-02 1.060067e-01
## r2CU
## 1.440840e-01
My McFadden R-squared is approximately 0.1. That indicates a quite good explanation of variability.
Assumptions
- Linearity assumption
probabilities <- predict((LR), type = "response")
predicted.classes <- ifelse(probabilities > 0.5, "pos", "neg")
#Linearity assumption only for numerical predictors:
numpred <- Logistic %>%
dplyr::select(involv_n,grad_n)
#Bind Logit and and tidying the data for plot
data <- numpred %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "numpred", value = "predictor.value", -logit)
data$logit=remove_outliers(data$logit)
Linearity<-ggplot(data, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~numpred, scales = "free_y")
LinearityIn both graphs, there seem to be an approximative linear relationship, even if we have still some outliers that pull our line to the top.
- No influential values
From the graph we can see that we have two huge influential points. Let’s try to remove them and run another model.
# Cooks distance & Leverage for every point
influence <- cooks.distance(LR)
leverage <- hatvalues(LR)
# Remove points that were very influential in our first model
Logistic2 <- Logistic %>%
mutate(cooks_distance = influence, leverage = leverage) %>%
filter(cooks_distance < 0.01) %>%
filter(leverage < 0.05)Without influential points
LR2 <- glm(Global_exporter ~country +mkt_innov +fam_ceo+involv_n*grad_n, data=Logistic2, family="binomial")
summary(LR2)##
## Call:
## glm(formula = Global_exporter ~ country + mkt_innov + fam_ceo +
## involv_n * grad_n, family = "binomial", data = Logistic2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2004 -0.9251 -0.6882 1.1612 2.3898
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.95582 0.29115 -3.283 0.00103 **
## countryFRA 0.17225 0.29884 0.576 0.56434
## countryGER 0.10820 0.29377 0.368 0.71263
## countryHUN -1.32656 0.43386 -3.058 0.00223 **
## countryITA 0.91648 0.29629 3.093 0.00198 **
## countrySPA 0.55661 0.30002 1.855 0.06356 .
## countryUK 0.50517 0.30474 1.658 0.09737 .
## mkt_innov1 0.77858 0.08778 8.870 < 2e-16 ***
## fam_ceo1 -0.20411 0.09031 -2.260 0.02382 *
## involv_n 10.17644 1.68146 6.052 1.43e-09 ***
## grad_n 11.94331 2.17403 5.494 3.94e-08 ***
## involv_n:grad_n -65.93723 11.30911 -5.830 5.53e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3534.0 on 2654 degrees of freedom
## Residual deviance: 3196.3 on 2643 degrees of freedom
## AIC: 3220.3
##
## Number of Fisher Scoring iterations: 4
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML
## -1.598138e+03 -1.767003e+03 3.377307e+02 9.556595e-02 1.194473e-01
## r2CU
## 1.623351e-01
Now our model is free from influential points and Even the McFadden coefficient increased
- Multicollinearity
## GVIF Df GVIF^(1/(2*Df))
## country 1.123999 6 1.009789
## fam_ceo 1.072582 1 1.035655
## mkt_innov 1.041033 1 1.020310
## involv_n 1.678298 1 1.295491
## grad_n 2.951623 1 1.718029
## involv_n:grad_n 2.455627 1 1.567044
## GVIF Df GVIF^(1/(2*Df))
## country 1.138037 6 1.010834
## mkt_innov 1.046030 1 1.022756
## fam_ceo 1.071465 1 1.035116
## involv_n 3.121307 1 1.766722
## grad_n 1.827780 1 1.351954
## involv_n:grad_n 3.081163 1 1.755324
In both cases we had not multicollinearity
- Requires a large sample size
## [1] 2665 8
Our sample size has these dimensions, so the assumption is checked.
Supervised Classification
In this case I set as labels the categorical variable that regards the external_financing of the firm. So my aim is predicting whether an unknown firm will be classified correctly in the group of firms that had external financing or not.
Let’s create a classification subset from our previous dataset:
Classif<-Efige %>%
dplyr::select(external_financing,empl,no_family_n, family_n, skilled_n,
unskilled_n,involv_n,grad_n,foreign_empl_n,core_p,po,f9,f10,f11,
foreign_affiliate,white_n,exp_richness)%>%
mutate(external_financing=as.factor(external_financing))
data1<-imputeMissings::compute(Classif, method = "median/mode")
Classif<-imputeMissings::impute(Classif, object = data1)
head(Classif)## external_financing empl no_family_n family_n skilled_n unskilled_n involv_n
## 1 2 500 25 0 10 4 200
## 2 1 500 0 0 10 4 10
## 3 2 500 38 2 20 520 30
## 4 2 30 4 2 15 4 1
## 5 2 18 4 3 5 6 0
## 6 2 25 4 2 8 5 1
## grad_n foreign_empl_n core_p po f9 f10 f11 foreign_affiliate white_n
## 1 35 0 100 10 2 60 12 6 4
## 2 10 0 100 80 4 70 20 2 4
## 3 30 0 100 100 2 60 12 2 20
## 4 2 3 100 90 1 60 12 2 5
## 5 1 4 95 100 3 60 12 2 0
## 6 0 5 85 0 3 60 12 2 6
## exp_richness
## 1 0
## 2 50
## 3 40
## 4 6
## 5 10
## 6 3
Let’s see how much our classification rule KNN is able to predict well our labels using our features (that in this case are almost all variables).
K-Nearest Neighbous (KNN)
set.seed(2009)
random<- sample(rep(1:14759)) # Randomly generate numbers from 1 to 14759
Classif<- Classif[random,] # Randomize Classif
Classif2<- as.data.frame(lapply(Classif[,c(2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)],normalize))
head(Classif2)## empl no_family_n family_n skilled_n unskilled_n involv_n
## 1 0.01632653 0.0011538462 0.00050 0.0013 0.0003201024 0.0006666667
## 2 0.01020408 0.0007692308 0.00050 0.0006 0.0003201024 0.0010000000
## 3 0.01020408 0.0000000000 0.00025 0.0008 0.0009603073 0.0000000000
## 4 0.03061224 0.0030769231 0.00000 0.0000 0.0000000000 0.0003333333
## 5 0.02244898 0.0000000000 0.00000 0.0010 0.0012804097 0.0000000000
## 6 0.02244898 0.0000000000 0.00025 0.0013 0.0006402049 0.0007000000
## grad_n foreign_empl_n core_p po f9 f10 f11
## 1 2.000002e-06 0.000000e+00 0.50 1.0 0.01020408 0.01 0.02040816
## 2 1.200001e-05 3.000003e-06 0.90 0.8 0.00000000 0.01 0.04081633
## 3 1.000001e-06 5.000005e-06 0.90 1.0 0.00000000 1.00 0.08163265
## 4 1.400001e-05 0.000000e+00 0.95 0.4 0.01020408 0.60 0.11224490
## 5 0.000000e+00 0.000000e+00 1.00 0.8 0.02040816 0.60 0.11224490
## 6 2.100002e-07 0.000000e+00 0.70 0.5 0.00000000 0.60 0.11224490
## foreign_affiliate white_n exp_richness
## 1 0.006666667 0.0003333333 0.006006006
## 2 0.006666667 0.0006666667 0.001001001
## 3 0.006666667 0.0005000000 0.001001001
## 4 0.006666667 0.0028333333 0.010010010
## 5 0.006666667 0.0006666667 0.003003003
## 6 0.006666667 0.0008333333 0.006006006
Efige.train<-Classif2[1:11807,]
Efige.train.label<- Classif[1:11807,1]
Efige.test<- Classif2[11807:14759 ,] #20% as test set
Efige.test.label<- Classif[11807:14759 ,1]
Classif2<-na.omit(Classif2)
predict<-knn(train=Efige.train, test=Efige.test, cl=Efige.train.label ,k=2)
Tab <- table(Efige.test.label, predict)
Tab # Confusion matrix## predict
## Efige.test.label 1 2
## 1 1144 145
## 2 109 1555
SPECIFICITY, SENSITIVITY AND AUC
## Decision Theory and Area under the Curve
##
## The original data implied the following 2 x 2 table
## predict
## Efige.test.label Predicted.Pos Predicted.Neg
## True.Pos 0.387 0.049
## True.Neg 0.037 0.527
##
## Conditional probabilities of
## predict
## Efige.test.label Predicted.Pos Predicted.Neg
## True.Pos 0.888 0.112
## True.Neg 0.066 0.934
##
## Accuracy = 0.91 Sensitivity = 0.89 Specificity = 0.93
## with Area Under the Curve = 0.97
## d.prime = 2.72 Criterion = 1.51 Beta = 0.74
## Observed Phi correlation = 0.82
## Inferred latent (tetrachoric) correlation = 0.96
In our case, both Sensitivity and specificity are very high.
The accuracy is pretty high 0.91, so it means that our classifier is very accurate in predicting labels.
The miss-classficiation error rate instead is simply 1-accuracy, so 0.09, meaning that we have few mistaken prediction of labels.
The AUC instead, indicates that our classifier has a very good (0.97) discriminating ability in recognizing firms that had external financing and firms that didn’t.
Plot Accuracy and Miss-classification error rate against neighbors
res=rep(NA,20)
for(i in 1:20){mod=knn(train=Efige.train, test=Efige.test, cl=Efige.train.label, k=i)
acc=sum(Efige.test.label==mod)
res[i]=1-acc/length(mod)}
par(mfrow=c(1,2))
plot(1:20,res,type="l",col="red", xlab="Neighbours", ylab="Misclassification error rate")
plot(1:20,1-res,type="l",col="blue", xlab="Neighbours", ylab="Accuracy")As the plot shows, as we increase the number of neighbors, the miss-classification error rate increases,a and the accuracy viceversa. The optimal number of neighbors seems to be around 5, since after that point both the curve are stabilized.
KNN with 10-fold cross-validation
This approach instead is based on resampling methods, and it’s very useful to not lose information of our dataset to implement our classifier
TrainData <- Classif[,2:17]
TrainClasses <- Classif[,1]
control <- trainControl(method="cv", number=10)
CrossValid <- train(TrainData, TrainClasses,
method = "knn",
preProcess = c("center", "scale"),
tuneLength = 12,
trControl = control)## Registered S3 method overwritten by 'e1071':
## method from
## print.fclust fclust
## 5-nearest neighbor model
## Training set outcome distribution:
##
## 1 2
## 6344 8415
## k Accuracy Kappa AccuracySD KappaSD
## 1 5 0.9237736 0.8441902 0.009213652 0.01903260
## 2 7 0.9226217 0.8417639 0.008664146 0.01792613
## 3 9 0.9198434 0.8359510 0.009027340 0.01869233
## 4 11 0.9180819 0.8322377 0.009421195 0.01956279
## 5 13 0.9163877 0.8287219 0.009270391 0.01923180
## 6 15 0.9157103 0.8272645 0.008868868 0.01840080
## 7 17 0.9138806 0.8234187 0.008485834 0.01767168
## 8 19 0.9131353 0.8218497 0.008683706 0.01811487
## 9 21 0.9109671 0.8172811 0.009333064 0.01946255
## 10 23 0.9101538 0.8155804 0.008731181 0.01820062
## 11 25 0.9085954 0.8122963 0.008987108 0.01874480
## 12 27 0.9081212 0.8112831 0.008921858 0.01861707
As expected, the best number of neighbors is 5 or less than 5, because if we go on, our accuracy will fall dramatically.
Clustering
Agglomerative Hierarchical Clustering
The feature selected for the unsupervised classification will be:
Number of employers;
Blue collar employers;
Number of banks;
Percentage of trained employers.
My aim is to understand if according to the selected predictors we can discover different patterns of data in our dataframe.
Selecting and pre-processing the data (normalizing predictors and removing outliers).
Clustering<-Efige%>%
dplyr::select(empl,skilled_n,f9,training)%>%
filter(country=="HUN"|country=="AUT"|country=="FRA"|country=="GER") # I choose to not select all countries for a matter of visualization, not because i want to find separation among countries!
data2<-imputeMissings::compute(Clustering, method = "median/mode")
Clustering<-imputeMissings::impute(Clustering, object = data2)
Clustering$training<-normalize(Clustering$training)
Clustering$f9<-normalize(Clustering$f9)
Clustering$empl<-normalize(Clustering$empl)
Clustering$skilled_n<-normalize(Clustering$skilled_n)
Clustering$training<-remove_outliers(Clustering$training)
Clustering$empl<-remove_outliers(Clustering$empl)
Clustering$skilled_n<-remove_outliers(Clustering$skilled_n)
Clustering$f9<-remove_outliers(Clustering$f9)
Clustering<-na.omit(Clustering)
head(Clustering)## empl skilled_n f9 training
## 4 0.04081633 0.0015 0.00000000 0.01
## 5 0.01632653 0.0005 0.02040816 0.00
## 6 0.03061224 0.0008 0.02040816 0.13
## 7 0.01428571 0.0005 0.02040816 0.05
## 9 0.02448980 0.0005 0.01020408 0.15
## 11 0.01020408 0.0006 0.01020408 0.10
In this step we compute our distance matrix, plot our dendogram and using the Elbow and Sihlouette method to understand how many cluster we will obtain according to our features.
dd=dist(Clustering, method="manhattan") # Distance matrix
cluster = hclust(dd, method ="average")
plot(cluster)
rect.hclust(cluster, k=2, border = 2:3) #Separate groups of dendogramplot1 <- fviz_nbclust(Clustering, FUN = hcut, method = "wss",
k.max = 10) +
ggtitle("Elbow method")
plot2 <- fviz_nbclust(Clustering, FUN = hcut, method = "silhouette",
k.max = 10) +
ggtitle("Silhouette method")
Plot <- grid.arrange(plot1,plot2, ncol=2)Descriptive graph of the two clusters:
ggpairs(cbind(Clustering, cut=as.factor(cut)),
columns=1:4, aes(colour=cut, alpha=0.5),
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both")+
theme_bw()From the dendogram we can see that we can identify two different clusters according to our features. Since now we cannot plot in a four dimension space, we must use a dimensionality reduction plot: PCA. This way of representing data is especially useful because it allows us to see the maximum spread of observations (if plotted on the first and second principal components).
We will try to plot our data on the factorial space along the first two components and see what we will obtain
At first glance we can tell that both groups of firms have a large within sum of squares and a small between sum of squares, meaning that observations are not so close to each other with respect to the cluster.
The overlapping of the two cluster indicates a group of firms that in terms of employers, blue collars, number of banks and percentage of trained employers is very similar.
We can also plot a circlized Dendogram for better visualization of our clusters.
Clusters = hclust(dd,method = "ave")
dend <- as.dendrogram(Clusters)
dend <- color_branches(dend, k=2)
circlize_dendrogram(dend, labels = FALSE)Unfortunately, the crisp clustering forces the observations to belong to a cluster or another, without considering that there could be a midway: Fuzzy clustering
Fuzzy clustering
Fuzzy clustering is based is another interesting approach to unsupervised classification, in which we are taking in account the degree of membership, in such a way that the observation can belong to more than one group.
Choice of groups based on three indexes that should be maximised:
The Modified Partition Coefficient index (MPC)
The Partition Coefficient index (PC)
The Fuzzy Silhouette index (FS)
Clustering=Clustering[,-1]
MPC=rep(NA,10)
for (i in 2:length(MPC)) {clusf=FKM(Clustering,i)
MPC[i]=MPC(clusf$U)}
MPC=MPC[-1]
PC=rep(NA,10)
for (i in 2:length(PC)) {clusf=FKM(Clustering,i)
PC[i]=PC(clusf$U)}
PC=PC[-1]
FS=rep(NA,10)
for (i in 2:length(FS)) {clusf=FKM(Clustering,i)
FS[i]=SIL.F(Clustering,clusf$U)}## The default value alpha=1 has been set
## The default value alpha=1 has been set
## The default value alpha=1 has been set
## The default value alpha=1 has been set
## The default value alpha=1 has been set
## The default value alpha=1 has been set
## The default value alpha=1 has been set
## The default value alpha=1 has been set
## The default value alpha=1 has been set
## MPC PC FS
## 1 0.7457851 0.8728925 0.8826363
## 2 0.7441771 0.8294514 0.8595656
## 3 0.7468647 0.8101485 0.8525431
## 4 0.7814114 0.8251291 0.8788448
## 5 0.7679537 0.8053658 0.8479812
## 6 0.7742184 0.8064730 0.8465916
Plot all three together to understand the optimal number of clusters:
x=c(2:10)
for (i in 1:ncol(Data)) {plot(x,Data[,i],type="l",lty=i,lwd=2,ylim=c(0,max(Data)+0.5),xlim=c(1,10),main="Number of Group Choice", xlab="Number of Groups",ylab="Index Value")
par(new=TRUE)}
legend("topright",c("MPC","PC","SIL.F"),lty=1:3,lwd=2)max MPC = 8
max SIL.F = 2
max PC = 2
So the optimal choice of clusters is 2
fannyx=fanny(Clustering,2)
fanni=fanny(Clustering, 2, diss = FALSE, memb.exp = 2, metric = "euclidean",
stand = FALSE, maxit = 500)
fviz_cluster(fanni,repel=TRUE)+
theme_minimal()+
scale_fill_brewer(palette = "Set1")From the plot above we can see that firms in the intersection of the two clusters belong (with different degree of membership) to both clusters.
Actually, plotting our observations in the two principal components we are explaining just the 70% of our variability.
Let’s now plot the degree of membership scatterplot to have a better visualization of our observation belonging to each cluster:
pc <- PCA(Clustering, ncp = 2 ,scale.unit = TRUE, graph = FALSE)
coo=pc$ind$coord
dataf<-data.frame(coo[,1],coo[,2])
x1=coo[,1]
x2=coo[,2]
colnames(dataf)<-c("x1","x2")
Name=rownames(dataf)
Fuzzyplot1<-ggplot(Clustering, aes(x=x1, y=x2, colour=fanni$membership[,1])) +
geom_point(size=1)+
scale_colour_gradient("",low="green",high="red") +
theme_minimal()+
ggtitle("Fuzzy Clustering - Cluster 1")
Fuzzyplot2<-ggplot(Clustering, aes(x=x1, y=x2, colour=fanni$membership[,2])) +
geom_point(size=1)+
scale_colour_gradient("",low="green",high="red") +
theme_minimal()+
ggtitle("Fuzzy Clustering - Cluster 2")
Fuzzyplot1The orange shaded observations are intended as firms that belong to both clusters